The Big Picture

Before digging into the details of policy gradient methods, we'll discuss how they work at a high level.

M3L3 C02 V6

## Quiz

Which of the following is true about how the policy gradient method will work? (Select all that apply.)

For each episode, if the agent won the game, we'll amend the policy network weights to make each (state, action) pair that appeared in the episode to make them more likely to repeat in future episodes.

For each episode, we randomly nudge the policy weights a bit by adding Gaussian noise, and if the agent did better, we keep the new weights; otherwise, we revert to the old weights.

For each episode, if the agent lost the game, we'll change the policy network weights to make it less likely to repeat the corresponding (state, action) pairs in future episodes.

After each time step, if we arrive at a terminal state, we add noise to the policy network weights.

SOLUTION:

For each episode, if the agent won the game, we'll amend the policy network weights to make each (state, action) pair that appeared in the episode to make them more likely to repeat in future episodes.
For each episode, if the agent lost the game, we'll change the policy network weights to make it less likely to repeat the corresponding (state, action) pairs in future episodes.